Method for sequencing a direct repeat

ABSTRACT

Described herein is a method of sequencing a template that comprises a direct repeat, comprising: (a) in the same reaction, hybridizing a primer to a first site that is upstream of the first repeat sequence and hybridizing a primer to a second site that is upstream of the second repeat sequence, wherein the first and second sites are: (i) upstream of the first and second repeat sequences, respectively, and (ii) equidistant from the first and second repeat sequences; and (b) subjecting the hybridization product of (a) to a sequencing-by-synthesis sequencing reaction to produce a sequence read that comprises a combination of the first and second repeat sequences.

CROSS-REFERENCING

This application claims the benefit of U.S. Provisional ApplicationSerial No. 62/818,527, filed on Mar. 14, 2019, which application isincorporated by reference herein.

BACKGROUND

Some sequencing methods require comparing two sequences with a singlesequence read to determine if there is a difference between thesequences. However, such methods can be challenging to perform becausethe software that performs this task needs to accurately identify thebeginnings and ends of the sequences in a sequence read that should becompared, extract sequences that should be compared, and then perform analignment of those sequence. These steps can be challenging toautomatically perform consistently for all different sequences, sequencecompositions and lengths. For example, the existence of repeatedsequences within a sequence read can cause slippage of an alignment,which may produce erroneous results.

The present disclosure provides an alternative, better way for comparingsequences with the same sequence read.

SUMMARY

A method of sequencing a template that comprises a direct repeat, i.e.,template comprising a first repeat sequence and a second repeat sequencethat is in direct orientation with the first repeat is provided. In someembodiments, the method may comprise, in the same reaction, hybridizinga primer to a first site that is upstream of the first repeat sequenceand hybridizing a primer to a second site that is upstream of the secondrepeat sequence. In these embodiments the first and second sites (i.e.,the sites to which the first and second primers bind) should be upstreamof the first and second repeat sequences, respectively (i.e., downstreamfrom the 3′ ends of the primers) and equidistant from the first andsecond repeat sequences. The hybridization product produced by this stepcontains the template with two primers annealed to it, both upstream ofa repeat sequence by the same distance (e.g., the same number of bases).Next, the method involves sequencing the template using asequencing-by-synthesis method (e.g., using fluorescent dye terminators)to produce a sequence read that comprises a combination of the first andsecond repeat sequences, i.e., a sequence read that is essentially tworeads (one from the first primer and the other from the second primer)that are merged with one another. Differences between the sequence ofthe first and second repeats can be identified as low-quality basecalls.

In some embodiments, within each template molecule the first repeatsequence and the second repeat sequence are amplified from oppositestrands of a double-stranded fragment of DNA. In these embodiments, thesequences of the first and second repeats should be identical except forpositions that correspond to damaged nucleotides in the double-strandedfragment of DNA or errors that occur during amplification. Thus, anydifferences between the top and bottom strands of the double-strandedfragment can be identified in the sequence read as a “low quality” basecall, i.e., a base that is associated with poor underlying data due tothere being, in effect, two different bases at a particular position inthe sequence. In more detail, within each template molecule the firstrepeat may be amplified from the one strand of a double-strandedfragment of genomic DNA and the second repeat may be amplified from theother strand of the same fragment of double-stranded fragment of genomicDNA. Within a molecule, the sequences of the first and second repeatsare often the same. However, in cases where there is damage in theoriginal molecule, the sequences of the first and second repeats (withina single molecule) may differ. As such, within each repeat molecule, thefirst and second repeats are typically identical except for positionsthat correspond to (a) damaged nucleotides in the double-strandedfragment of genomic DNA from which those strands were copied or (b)errors that occur during amplification of the direct repeat molecule(e.g., nucleotides that are mis-incorporated or deletions caused by astutter or slippage event during amplification). As such, the first andsecond repeats are typically at least 95% identical in sequence. Thus,the different repeats in a template molecule can be sequenced using twoprimers (one for each repeat) at the same time to determine if therepeats (which correspond to the top and complement of the bottomstrands of an initial fragment of genomic DNA) differ. Because twoprimers are used, the sequences of the first and second repeats aremerged in the same sequence read. Any differences between thosesequences can be observed as a low-quality base call because theunderlying data for that base call are essentially derived from twobases (one base read by the first primer and the other base read by thesecond primer, where those bases are the same distance downstream fromthe primers). If there is a low-quality base call at a particularposition, then the method may comprise excluding that base call fromfuture analysis. The method may be used to identify damaged nucleotidesand amplification errors, as well as sequencing errors (i.e., errorsthat stem from the sequence reaction itself, not in the sequencingtemplate).

The method finds particular use in analyzing samples of DNA that containdamaged DNA, samples in which the amount of DNA is limited and/orsamples that contain fragments having a low copy number mutation (e.g.,a sequence caused by a mutation that is present at low copy numberrelative to sequences that do not contain the mutation). These featuresare often present in patient samples that can be obtainednon-invasively, e.g., circulating tumor (ctDNA) samples, which can beobtained from peripheral blood, or invasively, e.g., tissue sections. Insome embodiments, the sample may be DNA obtained from tissue embedded inparaffin (i.e., an FFPE sample). In such samples, the mutant sequencesmay only be present at a very limited copy number (e.g., less than 10,less than 5 copies or even 1 copy in a background of hundreds orthousands of copies of the wild type sequence). In these situations,without an effective way to eliminate errors generated by DNA damage, itcan be almost impossible to identify a true sequence variation withsignificant confidence.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is best understood from the following detailed descriptionwhen read in conjunction with the accompanying drawings. It isemphasized that, according to common practice, the various features ofthe drawings are not to scale. Indeed, the dimensions of the variousfeatures are arbitrarily expanded or reduced for clarity. Included inthe drawings are the following figures.

FIG. 1 schematically illustrates a direct repeat template that has beenmade from a fragment of double-stranded genomic DNA.

FIG. 2 schematically illustrates where the first and second primers usedin the method hybridize a direct repeat template.

FIG. 3 schematically illustrates an example of the method.

FIG. 4 schematically illustrates an exemplary method by which a directrepeat molecule can be produced.

FIG. 5 schematically illustrates another exemplary method by which adirect repeat molecule can be produced.

DEFINITIONS

Unless defined otherwise herein, all technical and scientific terms usedherein have the same meaning as commonly understood by one of ordinaryskill in the art to which this invention belongs. Although any methodsand materials similar or equivalent to those described herein can beused in the practice or testing of the present invention, the preferredmethods and materials are described.

All patents and publications, including all sequences disclosed withinsuch patents and publications, referred to herein are expresslyincorporated by reference.

Numeric ranges are inclusive of the numbers defining the range. Unlessotherwise indicated, nucleic acids are written left to right in 5′ to 3′orientation; amino acid sequences are written left to right in amino tocarboxy orientation, respectively.

The headings provided herein are not limitations of the various aspectsor embodiments of the invention. Accordingly, the terms definedimmediately below are more fully defined by reference to thespecification as a whole.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Singleton, et al., DICTIONARYOF MICROBIOLOGY AND MOLECULAR BIOLOGY, 2D ED., John Wiley and Sons, NewYork (1994), and Hale & Markham, THE HARPER COLLINS DICTIONARY OFBIOLOGY, Harper Perennial, N.Y. (1991) provide one of skill with thegeneral meaning of many of the terms used herein. Still, certain termsare defined below for the sake of clarity and ease of reference.

It is further noted that the claims may be drafted to exclude anyoptional element. As such, this statement is intended to serve asantecedent basis for use of such exclusive terminology as “solely”,“only” and the like in connection with the recitation of claim elements,or the use of a “negative” limitation.

The term “sample” as used herein relates to a material or mixture ofmaterials, typically containing one or more analytes of interest. In oneembodiment, the term as used in its broadest sense, refers to any plant,animal, microbial or viral material containing genomic DNA, such as, forexample, tissue or fluid isolated from an individual (including withoutlimitation plasma, serum, cerebrospinal fluid, lymph, tears, saliva andtissue sections) or from in vitro cell culture constituents, as well assamples from the environment.

The term “nucleic acid sample,” as used herein, denotes a samplecontaining nucleic acids. Nucleic acid samples used herein may becomplex in that they contain multiple different molecules that containsequences. Genomic DNA samples from a mammal (e.g., mouse or human) aretypes of complex samples. Complex samples may have more than about 10⁴,10⁵, 10⁶ or 10⁷, 10⁸, 10⁹ or 10¹⁰ different nucleic acid molecules. ADNA target may originate from any source such as genomic DNA, or anartificial DNA construct. Any sample containing nucleic acids, e.g.,genomic DNA from tissue culture cells or a sample of tissue, may beemployed herein.

The term “mixture” as used herein, refers to a combination of elements,that are interspersed and not in any particular order. A mixture isheterogeneous and not spatially separable into its differentconstituents. Examples of mixtures of elements include a number ofdifferent elements that are dissolved in the same aqueous solution and anumber of different elements attached to a solid support at randompositions (i.e., in no particular order). A mixture is not addressable.To illustrate by example, an array of spatially separated surface-boundpolynucleotides, as is commonly known in the art, is not a mixture ofsurface-bound polynucleotides because the species of surface-boundpolynucleotides are spatially distinct, and the array is addressable.

The term “nucleotide” is intended to include those moieties that can becopied using a polymerase. Nucleotides contain not only the known purineand pyrimidine bases, but also other heterocyclic bases that have beenmodified e.g., “damaged” bases that have oxidized or deadenylated forexample. Such modifications include methylated purines or pyrimidines,acylated purines or pyrimidines, alkylated riboses or otherheterocycles. In addition, the term “nucleotide” includes those moietiesthat contain hapten or fluorescent labels and may contain not onlyconventional ribose and deoxyribose sugars, but other sugars as well.Modified nucleosides or nucleotides also include modifications on thesugar moiety, e.g., wherein one or more of the hydroxyl groups arereplaced with halogen atoms or aliphatic groups, or are functionalizedas ethers, amines, or the like.

The term “nucleic acid” and “polynucleotide” are used interchangeablyherein to describe a polymer of any length, e.g., greater than about 2bases, greater than about 10 bases, greater than about 100 bases,greater than about 500 bases, greater than 1000 bases, greater than10,000 bases, greater than 100,000 bases, greater than about 1,000,000,up to about 10¹⁰ or more bases composed of nucleotides, e.g.,deoxyribonucleotides or ribonucleotides, and may be producedenzymatically or synthetically (e.g., PNA as described in U.S. Pat. No.5,948,902 and the references cited therein) which can hybridize withnaturally occurring nucleic acids in a sequence specific manneranalogous to that of two naturally occurring nucleic acids, e.g., canparticipate in Watson-Crick base pairing interactions.Naturally-occurring nucleotides include guanine, cytosine, adenine,thymine, uracil (G, C, A, T and U respectively). DNA and RNA have adeoxyribose and ribose sugar backbone, respectively, whereas PNA’sbackbone is composed of repeating N-(2-aminoethyl)-glycine units linkedby peptide bonds. In PNA various purine and pyrimidine bases are linkedto the backbone by methylenecarbonyl bonds. A locked nucleic acid (LNA),often referred to as inaccessible RNA, is a modified RNA nucleotide. Theribose moiety of an LNA nucleotide is modified with an extra bridgeconnecting the 2′ oxygen and 4′ carbon. The bridge “locks” the ribose inthe 3′-endo (North) conformation, which is often found in the A-formduplexes. LNA nucleotides can be mixed with DNA or RNA residues in theoligonucleotide whenever desired. The term “unstructured nucleic acid,”or “UNA,” is a nucleic acid containing non-natural nucleotides that bindto each other with reduced stability. For example, an unstructurednucleic acid may contain a G′ residue and a C′ residue, where theseresidues correspond to non-naturally occurring forms, i.e., analogs, ofG and C that base pair with each other with reduced stability, butretain an ability to base pair with naturally occurring C and Gresidues, respectively. Unstructured nucleic acid is described inUS20050233340, which is incorporated by reference herein for disclosureof UNA.

The term “oligonucleotide” as used herein denotes a single-strandedmultimer of nucleo of from about 2 to 200 nucleotides, up to 500nucleotides in length. Oligonucleotides may be synthetic or may be madeenzymatically, and, in some embodiments, are 30 to 150 nucleotides inlength. Oligonucleotides may contain ribonucleotide monomers (i.e., maybe oligoribonucleotides) or deoxyribonucleotide monomers, or bothribonucleotide monomers and deoxyribonucleotide monomers. Anoligonucleotide may be 10 to 20, 21 to 30, 31 to 40, 41 to 50, 51 to 60,61 to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides inlength, for example.

“Primer” means an oligonucleotide, either natural or synthetic, that iscapable, upon forming a duplex with a polynucleotide template, of actingas a point of initiation of nucleic acid synthesis and being extendedfrom its 3′ end along the template so that an extended duplex is formed.The sequence of nucleotides added during the extension process isdetermined by the sequence of the template polynucleotide. Usuallyprimers are extended by a DNA polymerase. Primers are generally of alength compatible with their use in synthesis of primer extensionproducts and are usually in the range of between 8 to 100 nucleotides inlength, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to50, 22 to 45, 25 to 40, and so on. Typical primers can be in the rangeof between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25and so on, and any length between the stated ranges. In someembodiments, the primers are usually not more than about 10, 12, 15, 20,21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or70 nucleotides in length. In some embodiments a primer can be activatedprior to primer extension. For example, some primers have a 3′ block andinternal RNA base. The RNA base can be removed by RNaseH or anothertreatment, thereby producing a 3′ hydroxyl group which can be extended.Other methods for activating primers exist.

Primers are usually single-stranded for maximum efficiency inamplification but may alternatively be double-stranded or partiallydouble-stranded. If double-stranded, the primer is usually first treatedto separate its strands before being used to prepare extension products.This denaturation step is typically affected by heat, but mayalternatively be carried out using alkali, followed by neutralization.Also included in this definition are toehold exchange primers, asdescribed in Zhang et al (Nature Chemistry 2012 4: 208-214), which isincorporated by reference herein.

Thus, a “primer” is complementary to a template, and complexes byhydrogen bonding or hybridization with the template to give aprimer/template complex for initiation of synthesis by a polymerase,which is extended by the addition of covalently bonded bases linked atits 3′ end complementary to the template in the process of DNAsynthesis.

The term “hybridization” or “hybridizes” refers to a process in which aregion of a nucleic acid strand anneals to and forms a stable duplex,either a homoduplex or a heteroduplex, under normal hybridizationconditions with a second complementary nucleic acid strand and does notform a stable duplex with unrelated nucleic acid molecules under thesame normal hybridization conditions. The formation of a duplex isaccomplished by annealing two complementary nucleic acid strand regionsin a hybridization reaction. The hybridization reaction can be made tobe highly specific by adjustment of the hybridization conditions (oftenreferred to as hybridization stringency) under which the hybridizationreaction takes place, such that two nucleic acid strands will not form astable duplex, e.g., a duplex that retains a region ofdouble-strandedness under normal stringency conditions, unless the twonucleic acid strands contain a certain number of nucleotides in specificsequences which are substantially or completely complementary. “Normalhybridization or normal stringency conditions” are readily determinedfor any given hybridization reaction. See, for example, Ausubel et al.,Current Protocols in Molecular Biology, John Wiley & Sons, Inc., NewYork, or Sambrook et al., Molecular Cloning: A Laboratory Manual, ColdSpring Harbor Laboratory Press. As used herein, the term “hybridizing”or “hybridization” refers to any process by which a strand of nucleicacid binds with a complementary strand through base pairing.

A nucleic acid is considered to be “selectively hybridizable” to areference nucleic acid sequence if the two sequences specificallyhybridize to one another under moderate to high stringency hybridizationand wash conditions. Moderate and high stringency hybridizationconditions are known (see, e.g., Ausubel, et al., Short Protocols inMolecular Biology, 3rd ed., Wiley & Sons 1995 and Sambrook et al.,Molecular Cloning: A Laboratory Manual, Third Edition, 2001 Cold SpringHarbor, N.Y.). One example of high stringency conditions includehybridization at about 42° C. in 50% formamide, 5X SSC, 5X Denhardt’ssolution, 0.5% SDS and 100 µg/ml denatured carrier DNA followed bywashing two times in 2X SSC and 0.5% SDS at room temperature and twoadditional times in 0.1X SSC and 0.5% SDS at 42° C.

The term “amplifying” as used herein refers to the process ofsynthesizing nucleic acid molecules that are complementary to one orboth strands of a template nucleic acid. Amplifying a nucleic acidmolecule may include denaturing the template nucleic acid, annealingprimers to the template nucleic acid at a temperature that is below themelting temperatures of the primers, and enzymatically elongating fromthe primers to generate an amplification product. The denaturing,annealing and elongating steps each can be performed one or more times.In certain cases, the denaturing, annealing and elongating steps areperformed multiple times such that the amount of amplification productis increasing, often times exponentially, although exponentialamplification is not required by the present methods. Amplificationtypically requires the presence of deoxyribonucleoside triphosphates, aDNA polymerase enzyme and an appropriate buffer and/or co-factors foroptimal activity of the polymerase enzyme. The term “amplificationproduct” refers to the nucleic acids, which are produced from theamplifying process as defined herein.

The terms “determining,” “measuring,” “evaluating,” “assessing,”“assaying,” and “analyzing” are used interchangeably herein to refer toany form of measurement and include determining if an element is presentor not. These terms include both quantitative and/or qualitativedeterminations. Assessing may be relative or absolute. “Assessing thepresence of” includes determining the amount of something present, aswell as determining whether it is present or absent.

The term “ligating,” as used herein, refers to the enzymaticallycatalyzed joining of the terminal nucleotide at the 5′ end of a firstDNA molecule to the terminal nucleotide at the 3′ end of a second DNAmolecule.

A “plurality” contains at least 2 members. In certain cases, a pluralitymay have at least 2, at least 5, at least 10, at least 100, at least100, at least 10,000, at least 100,000, at least 10⁶, at least 10⁷, atleast 10⁸ or at least 10⁹ or more members.

An “oligonucleotide binding site” refers to a site to which anoligonucleotide hybridizes in a target polynucleotide. If anoligonucleotide “provides” a binding site for a primer, then the primermay hybridize to that oligonucleotide or its complement.

The term “strand” as used herein refers to a nucleic acid made up ofnucleotides covalently linked together by covalent bonds, e.g.,phosphodiester bonds. In a cell, DNA usually exists in a double-strandedform, and as such, has two complementary strands of nucleic acidreferred to herein as the “Watson” (or “top”) and “Crick” (or “bottom”)strands. In certain cases, complementary strands of a chromosomal regionmay be referred to as “plus” and “minus” strands, the “first” and“second” strands, the “coding” and “noncoding” strands, the “top” and“bottom” strands or the “sense” and “antisense” strands. The assignmentof a strand as being a Watson or Crick strand is arbitrary and does notimply any particular orientation, function or structure.

The term “extending”, as used herein, refers to the extension of aprimer by the addition of nucleotides using a polymerase. If a primerthat is annealed to a nucleic acid is extended, the nucleic acid acts asa template for extension reaction.

The term “sequencing,” as used herein, refers to a method by which theidentity of at least 10 consecutive nucleotides (e.g., the identity ofat least 20, at least 50, at least 100 or at least 200 or moreconsecutive nucleotides) of a polynucleotide is obtained.

The terms “next-generation sequencing” or “high-throughput sequencing”,as used herein, refer to the so-called parallelizedsequencing-by-synthesis or sequencing-by-ligation platforms currentlyemployed by Illumina, Life Technologies, and Roche, etc. Next-generationsequencing methods may also include nanopore sequencing methods such asthat commercialized by Oxford Nanopore Technologies,electronic-detection based methods such as Ion Torrent technologycommercialized by Life Technologies, or single-moleculefluorescence-based methods such as that commercialized by PacificBiosciences.

The term “barcode sequence” or “molecular barcode”, as used herein,refers to a unique sequence of nucleotides that can be used to a)identify and/or track the source of a polynucleotide in a reaction, b)count how many times an initial molecule is sequenced and c) pairsequence reads from different strands of the same molecule. Barcodesequences may vary widely in size and composition; the followingreferences provide guidance for selecting sets of barcode sequencesappropriate for particular embodiments: Casbon (Nuc. Acids Res. 2011, 22e81), Brenner, U.S. Pat. No. 5,635,400; Brenner et al., Proc. Natl.Acad. Sci., 97: 1665-1670 (2000); Shoemaker et al., Nature Genetics, 14:450-456 (1996); Morris et al., European patent publication 0799897A1;Wallace, U.S. Pat. No. 5,981,179; and the like. In particularembodiments, a barcode sequence may have a length in range of from 2 to36 nucleotides, or from 6 to 30 nucleotides, or from 8 to 20nucleotides.

In some cases, a barcode may contain a “degenerate base region” or“DBR”, where the terms “degenerate base region” and “DBR” refers to atype of molecular barcode that has complexity that is sufficient to helpone distinguish between fragments to which the DBR has been added. Insome cases, substantially every tagged fragment may have a different DBRsequence. In these embodiments, a high complexity DBR may be used (e.g.,one that is composed of at least 10,000 or 100,000, or more sequences).In other embodiments, some fragments may be tagged with the same DBRsequence, but those fragments can still be distinguished by thecombination of i. the DBR sequence, ii. the sequence of the fragment,iii. the sequence of the ends of the fragment, and/or iv. the site ofinsertion of the DBR into the fragment. In some embodiments, at least95%, e.g., at least 96%, at least 97%, at least 98%, at least 99% or atleast 99.5% of the target polynucleotides become associated with adifferent DBR sequence. In some embodiments, a DBR may comprise one ormore (e.g., at least 2, at least 3, at least 4, at least 5, or 5 to 30or more) nucleotides selected from R, Y, S, W, K, M, B, D, H, V, N (asdefined by the IUPAC code). In some cases, a double-stranded barcode canbe made by making an oligonucleotide containing degenerate sequence(e.g., an oligonucleotide that has a run of 2-10 or more “Ns”) and thencopying the complement of the barcode onto the other strand, asdescribed below.

Oligonucleotides that contain a variable sequence, e.g., a DBR, can bemade by making a number of oligonucleotides separately, mixing theoligonucleotides together, and by amplifying them en masse. In otherwords, the population of oligonucleotides that contain a variablesequence can be made as a single oligonucleotide that containsdegenerate positions (i.e., positions that contain more than one type ofnucleotide). Alternatively, such a population of oligonucleotides can bemade by fabricating them individually or using an array of theoligonucleotides using in situ synthesis methods, cleaving theoligonucleotides from the substrate and optionally amplifying them.Examples of such methods are described in, e.g., Cleary et al. (NatureMethods 2004 1: 241-248) and LeProust et al. (Nucleic Acids Research2010 38: 2522-2540).

In some cases, a barcode may be error correcting. Descriptions ofexemplary error identifying (or error correcting) sequences can be foundthroughout the literature (e.g., in are described in U.S. PatentApplication Publications US2010/0323348 and US2009/0105959 bothincorporated herein by reference). Error-correctable codes may benecessary for quantitating absolute numbers of molecules. Many reportsin the literature use codes that were originally developed forerror-correction of binary systems (Hamming codes, Reed Solomon codesetc.) or apply these to quaternary systems (e.g. quaternary Hammingcodes; see Generalized DNA barcode design based on Hamming codes,Bystrykh 2012 PLoS One. 2012 7: e36852).

In some embodiments, a barcode may additionally be used to determine thenumber of initial target polynucleotide molecules that have beenanalyzed, i.e., to “count” the number of initial target polynucleotidemolecules that have been analyzed. PCR amplification of molecules thathave been tagged with a barcode can result in multiple sub-populationsof products that are clonally-related in that each of the differentsub-populations is amplified from a single tagged molecule. As would beapparent, even though there may be several thousand or millions or moreof molecules in any of the clonally-related sub-populations of PCRproducts and the number of target molecules in those clonally-relatedsub-populations may vary greatly, the number of molecules tagged in thefirst step of the method can be estimated by counting the number of DBRsequences associated with a target sequence that is represented in thepopulation of PCR products. This number is useful because, in certainembodiments, the population of PCR products made using this method maybe sequenced to produce a plurality of sequences. The number ofdifferent barcode sequences that are associated with the sequences of atarget polynucleotide can be counted, and this number can be used (alongwith, e.g., the sequence of the fragment, the sequence of the ends ofthe fragment, and/or the site of insertion of the DBR into the fragment)to estimate the number of initial template nucleic acid molecules thathave been sequenced. Such tags can also be useful in correctingsequencing errors.

The terms “sample identifier sequence” or “sample index” refer to a typeof barcode that can be appended to a target polynucleotide, where thesequence identifies the source of the target polynucleotide (i.e., thesample from which the target polynucleotide is derived). In use, eachsample is tagged with a different sample identifier sequence (e.g., onesequence is appended to each sample, where the different samples areappended to different sequences), and the tagged samples are pooled.After the pooled sample is sequenced, the sample identifier sequence canbe used to identify the source of the sequences.

The term “adapter” refers to a nucleic acid that can be joined to atleast one strand of a double-stranded DNA molecule. The term “adapter”refers to molecules that are at least partially double-stranded. Anadaptor may be 20 to 150 bases in length, e.g., 40 to 120 bases,although adaptors outside of this range are envisioned.

The term “adaptor-tagged,” as used herein, refers to a nucleic acid thathas been tagged by, i.e., covalently linked with, an adaptor. An adaptorcan be joined to a 5′ end and/or a 3′ end of a nucleic acid molecule.

The term “tagged DNA” as used herein refers to DNA molecules that havean added adaptor sequence, i.e., a “tag” of synthetic origin. An adaptorsequence can be added (i.e., “appended”) by ligation.

The term “complexity” refers to the total number of different sequencesin a population. For example, if a population has 4 different sequencesthen that population has a complexity of 4. A population may have acomplexity of at least 4, at least 8, at least 16, at least 100, atleast 1,000, at least 10,000 or at least 100,000 or more, depending onthe desired result.

The term “of the formula” means that the individual molecules in apopulation are described by, i.e., encompassed by, the formula.

Certain polynucleotides described herein may be referred to by aformula. Unless otherwise indicated the polynucleotides defined by aformula are oriented in the 5′ to 3′ direction. The components of theformula refer to separately definable sequences of nucleotides within apolynucleotide, where, unless implicit from the context, the sequencesare linked together covalently such that a polynucleotide described by aformula is a single molecule. In some cases, the components of theformula are immediately adjacent to one another in the single molecule.Unless otherwise indicated or implicit from the context, a regiondefined by a formula may have additional sequences, a primer bindingsite, a molecular barcode, a promoter, or a spacer, etc., at its 3′ end,its 5′ end or both the 3′ and 5′ ends. As would be apparent, the variouscomponent sequences of a polynucleotide may independently be of anydesired length as long as they are capable of performing the desiredfunction (e.g., hybridization to another sequence). For example, thevarious component sequences of a polynucleotide may independently have alength in the range of 8-80 nucleotides, e.g., 10-50 nucleotides or12-30 nucleotides.

The term “opposite strands”, as used herein, refers to the top andbottom strands, where the strands are complementary to one another,except for damaged nucleotides.

The term “potential sequence variation”, as used herein, refers to asequence variation, e.g., a substitution, deletion, insertion orrearrangement of one or more nucleotides in one sequence relative toanother.

The term “amplification error” refers to a mis-incorporated base, or adeletion/insertion caused by polymerase stutter. Stutter usually occursin repeat sequences, e.g., short tandem repeats (STRs) or microsatelliterepeats and is presumed to be due to miscopying or slippage by thepolymerase

The term “target enrichment”, as used herein, refers to a method inwhich selected sequences are separated from other sequences in a sample.This may be done by hybridization to a probe, e.g., hybridizing abiotinylated oligonucleotide to the sample to produce duplexes betweenthe oligonucleotide and the target sequence, immobilizing the duplexesvia the biotin group, washing the immobilized duplexes, and thenreleasing the target sequences from the oligonucleotides. Alternatively,a selected sequence may be enriched by amplifying that sequence, e.g.,by PCR using one or more primers that hybridize to a site that isproximal to the target sequence.

The terms “minority variant” and “sequence variation”, as used herein,is a variant that is present at a frequency of less than 50%, relativeto other molecules in the sample. In some cases, a minority variant maybe a first allele of a polymorphic target sequence, where, in a sample,the ratio of molecules that contain the first allele of the polymorphictarget sequence compared to molecules that contain other alleles of thepolymorphic target sequence is 1:5 or less, 1:10 or less, 1:100 or less,1:1,000 or less, 1: 10,000 or less, 1: 100,000 or less or 1:1,000,000 orless.

The term “duplex sequencing” refers to a method in which sequences forboth strands of a double-stranded molecule of genomic DNA are obtained.In duplex sequencing, the sequences derived from the top strand ofdouble-stranded molecule of genomic DNA are distinguishable fromsequences derived from the bottom strand of that molecule in such a waythat the sequences for the top and bottom strands from the samedouble-stranded molecule of genomic DNA can be compared.

The term “direct repeat” refers a molecule that contains two copies ofnear identical sequences, i.e., sequences that are of the same lengthand that are at least 95% identical in nucleotide sequence.

The term “distance” as used herein depends on thesequencing-by-synthesis method being used for sequencing. For example,in methods that rely on reversible chain terminators the distancebetween the 3′ end of a primer and a downstream nucleotide can bedefined by the number of bases. In semiconductor or pyrosequencingmethods the distance between the 3′ end of a primer and a downstreamnucleotide can be defined by the number of flows because, in thosemethods, several nucleotides can be added in a single flow. Thus,“equidistant” can mean the same number of nucleotides if a reversiblechain terminator-based sequencing method is used or the same number offlows if a semiconductor- or pyrosequencing-based sequencing methods isused.

For ease of reference, the reverse complement of a sequence may beindicated by the prime (“ ′ ”) symbol. For example, the reversecomplement of a sequence referred to as “W” is may be referred to as“W”′.

Other definitions of terms may appear throughout the specification.

DETAILED DESCRIPTION OF THE INVENTION

Before the present invention is described, it is to be understood thatthis invention is not limited to particular embodiments described, assuch may, of course, vary. It is also to be understood that theterminology used herein is for the purpose of describing particularembodiments only, and is not intended to be limiting, since the scope ofthe present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that eachintervening value, to the tenth of the unit of the lower limit unlessthe context clearly dictates otherwise, between the upper and lowerlimits of that range is also specifically disclosed. Each smaller rangebetween any stated value or intervening value in a stated range and anyother stated or intervening value in that stated range is encompassedwithin the invention. The upper and lower limits of these smaller rangesmay independently be included or excluded in the range, and each rangewhere either, neither or both limits are included in the smaller rangesis also encompassed within the invention, subject to any specificallyexcluded limit in the stated range. Where the stated range includes oneor both of the limits, ranges excluding either or both of those includedlimits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of the present invention, some potential andpreferred methods and materials are now described. All publicationsmentioned herein are incorporated herein by reference to disclose anddescribe the methods and/or materials in connection with which thepublications are cited. It is understood that the present disclosuresupersedes any disclosure of an incorporated publication to the extentthere is a contradiction.

It must be noted that as used herein and in the appended claims, thesingular forms “a”, “an”, and “the” include plural referents unless thecontext clearly dictates otherwise. Thus, for example, reference to “anucleic acid” includes a plurality of such nucleic acids and referenceto “the compound” includes reference to one or more compounds andequivalents thereof known to those skilled in the art, and so forth.

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the example herein below. However,other equivalent conventional procedures can, of course, also be used.Such conventional techniques and descriptions can be found in standardlaboratory manuals such as Genome Analysis: A Laboratory Manual Series(Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A LaboratoryManual, PCR Primer: A Laboratory Manual, and Molecular Cloning: ALaboratory Manual (all from Cold Spring Harbor Laboratory Press),Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait,“Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, A., Principles of Biochemistry3^(rd) Ed., W. H. Freeman Pub., New York, N.Y. and Berg et al. (2002)Biochemistry, 5^(th) Ed., W. H. Freeman Pub., New York, N.Y., all ofwhich are herein incorporated in their entirety by reference for allpurposes.

The publications discussed herein are provided solely for theirdisclosure prior to the filing date of the present application. Nothingherein is to be construed as an admission that the present invention isnot entitled to antedate such publication by virtue of prior invention.Further, the dates of publication provided may be different from theactual publication dates which may need to be independently confirmed.

Provided herein, among other things, is a way to sequence a templatethat has a direct repeat, i.e., a template that comprises a first repeatsequence and a second repeat sequence, wherein the first and secondrepeat sequences are in a direct repeat and either identical or nearlyidentical. In some embodiments within each template molecule, the firstrepeat sequence and the second repeat sequence may be amplified fromopposite strands of a double-stranded fragment of DNA. In embodiments inwhich the fragment of DNA is double-stranded genomic DNA (e.g.,eukaryotic genomic DNA, which may be isolated from a tissue biopsy ormay be cell-free DNA (cfDNA), microbial genomic DNA or viral genomicDNA), the sequences of the repeats may be identical except for positionsthat correspond to damaged nucleotides in the double-stranded fragmentof DNA or errors that occur during amplification. An example of such adirect repeat is illustrated in FIG. 1 . As shown, within each repeatmolecule the first repeat and the second repeat are amplified fromopposite strands of a fragment of double-stranded genomic DNA, e.g.,genomic DNA. The first repeat has the same or a very similar sequence asone strand (the top strand of the fragment, for example) of a fragmentof double-stranded genomic DNA whereas the second repeat has the same ora very similar sequence as the reverse complement of the other strand ofthe fragment (e.g., the bottom strand of the fragment). In embodiments,in which the fragment is genomic DNA, the first and second repeatsequences should be identical except for nucleotides that correspond to(i.e., are at a position that corresponds to the position of) damagednucleotides in the fragment of double-stranded genomic DNA or errorsthat have occurred during amplification. In other embodiments, thedouble-stranded fragment may be made synthetically or derived from adouble-stranded plasmid, for example.

A “damaged nucleotide” refers to any derivative of adenine, cytosine,guanine, and thymine that has been altered in a way that allows it topair with a different base. In non-damaged DNA, A base pairs with T andC base pairs with G. However, some bases can be oxidized, alkylated ordeaminated in a way that effects base pairing. For example,7,8-dihydro-8-oxoguanine (8-oxo-dG) is a derivative of guanine that basepairs with adenine instead of cytosine. This derivative causes a G to Ttransversion after replication. Deamination of cytosine produces uracil,which can base pair with adenine, leading to a C to T change afterreplication. Other examples or damaged nucleotide that are capable ofmismatched pairing include are known.

Within a direct repeat template molecule, the sequences of first andsecond repeats have identical lengths and are at least 95% identical(e.g., at least 95% identical, at least 96% identical, at least 97%identical, at least 98% identical, at least 99% identical or 100%identical, depending on, e.g., the extent of DNA damage in the fragmentof double-stranded genomic DNA and/or amplification errors) and, withthe exception of nucleotides that correspond to damaged nucleotides andamplification errors, should be identical. As shown, the molecules mayhave a unit length of 1, meaning that there is only one copy of firstrepeat and one copy of the second repeat in each molecule. The templatemolecules may be single stranded or double stranded. However, as wouldbe appreciated, the template is in its single stranded form when it isbeing sequenced. The sequence of the first and second repeats may have alength of at least 50 nucleotides and in some embodiments may be in therange of 50 nucleotides to 2 kb in length, e.g., 50-500 nt or 50-300 nt.In some embodiments, the direct repeat template may be in a sample thatcontains other direct repeat templates. Within the population, thecomplexity and median length of the sequence of the first repeat mayvary and may be approximately the same as the complexity and medianlength of the sequence of the second repeat, since those sequences arealmost identical. In the population, the first repeat and the secondrepeat may each have a complexity of at least 10³, e.g., at least 10⁴,at least 10⁵, at least 10⁶, at least 10⁷, at least 10⁸, at least 10⁹ orat least 10¹⁰, for example, meaning that in the population, the firstrepeat and the second repeat are each represented by at least 10³different sequences. The lengths of the first and second repeats maydepend on the lengths of the fragments of DNA in the sample from whichthe molecules are made. In some embodiments, the fragments may have amedian size that no more than 2 kb in length (e.g., in the range of 50bp to 2 kb, e.g., 75 bp to 1.5 kb, 100 bp to 1 kb, 100 bp to 500 bp).The lengths of the fragment may be tailored to the sequencing platformbeing used. Examples of how these molecules can be made will bedescribed in greater detail below.

Therefore, in any embodiment, the direct repeat molecule may be made bycopying a double-stranded fragment of DNA to produce the direct repeatmolecule, where the first and second repeats of the direct repeatmolecule are to be amplified from opposite strands of thedouble-stranded fragment of DNA.

As noted above, in some embodiments the method may comprise, in the samereaction, hybridizing a primer to a first site that is upstream of thefirst repeat sequence and hybridizing a primer to a second site that isupstream of the second repeat sequence. In these embodiments, the firstand second sites (i.e., the sites to which the first and second primersbind, respectively) are upstream of the first and second repeatsequences, respectively, and equidistant from the first and secondrepeat sequences. This is illustrated in FIG. 2 . As illustrated, thefirst primer binds to a site that is upstream of (i.e., 3′ to) the firstrepeat whereas the second primer binds to a site that is upstream of(i.e., 3′ to) the second repeat, where the distances between the primersand their respective repeats are the same. Illustrated by example, ifthe 3′ end of the first primer hybridizes to a nucleotide that isupstream of (i.e., 3′ to) the first repeat by n bases (where n is in therange of, e.g., 5 to 30) then the 3′ end of the second primer hybridizesto a nucleotide that is upstream of (i.e., 3′ to) the second repeat by nbases. While the distance between the primer binding sites and therepeats can be defined by the number of bases for some sequencingmethods (e.g., Illumina’s dye terminator sequencing method), thedistance can be defined by “flows” in other methods (e.g. Ion Torrent orpyrosequencing methods).

After hybridization of the primers, the method may comprise subjectingthe hybridization product to a sequencing-by-synthesis sequencingreaction to produce a sequence read that comprises a combination of thefirst and second repeat sequences, meaning that the sequences are mergedinto one. In some embodiments, sequencing-by-synthesis methods are thosethat involve extending a primer using a template and detecting whichnucleotide is added at each position. Sequencing-by-synthesis methodsincluded, but are not limited to, Illumina’s reversible dye terminatormethod, Thermo’s Ion Torrent method (which detects ions as they arereleased by DNA polymerase) and pyrosequencing, although others areknown. In the reversible dye terminator approach, the sequence of atemplate is determined using reversible terminators chemistry (Turcattiet al., Nucleic Acids Res. 2008 36:e25). In every sequencing cycle asingle fluorescently labeled, 3′-blocked nucleotide is added in atemplated primer extension reaction. After incorporation, the identityof the fluorescent label added is detected by fluorescent imaging. Ineach round, the labels and terminators are chemically removed in orderto prepare the primer extension product the next cycle. A more detaileddescription of the process can be found in Bentley, supra.

As noted above, the sequence read produced using this method will be acombination of the first and the second repeat sequences, where the term“combination” is intended to mean that the sequences of the first andsecond repeats are merged, superimposed or melded into one. By way ofexample, if the sequence of the first repeat is GATCGGATCGA (SEQ IDNO: 1) and sequence of the second repeat is GATCGGATCGA (SEQ ID NO: 1),then the sequence read will contain only one copy of the sequenceGATCGGATCGA (SEQ ID NO: 1), where some of the signal used to generatethe sequence read is generated by extension of the first primer and someof the signal used to generate the sequence read is generated by theextension of the second primer in the same reaction.

Differences in the sequences of the first and second repeats can beidentified because the underlying signal corresponding to the differencewill be mixed (i.e., will be a composite of signals produced by twodifferent bases at that position). Positions that have a mixed signalcan be identified because they are associated with a low-quality basecall. As such, differences in the sequences of the first and secondrepeats can be identified as positions that have a low-quality basecall. In these embodiments, the sequence read comprises, for eachposition of the sequence read, a quality score indicating thereliability of the base(s) called at that position. Base calling is theprocess by which an order of nucleotides in a template is inferredduring a sequencing reaction. For example, next generation sequencingplatforms that use fluorescently labeled reversible terminators have aunique color for each base. These are incorporated into thecomplementary strand of the DNA template and captured with a sensitiveCCD camera. These images are processed into signals which are used toinfer the order of nucleotides, also known as base calling.

Base calling accuracy can be measured a variety of different ways. Insome embodiments base calling accuracy can be measured using a Q score(Phred quality score), which is a common metric to assess the accuracyof a sequencing run. Q scores are defined as logarithmically related tobase calling error probability, where Q = - 10 log P / log 10. In thissystem, if a base is assigned a Q score of 40, this is equal to theprobability of an incorrect base call of 1 in 10,000 times, or 99.99%base calling accuracy; a lower Q score of 10 means, there is theprobability of an incorrect call in 1 of 10 bases. Lower Q scores canlead to increases in false positive variant calls and reduces theoverall confidence an investigator has in their sequencing data. Detailsof base calling and methods for calculating the quality of a base callare described in a variety of publications, including, e.g., Ledergerberet al. (Brief Bioinform. 2011 12: 489-497), Whiteford et al.(Bioinformatics 2009 25: 2194-2199), Erlich (Nat. Methods. 2008 5:679-682) and Kao et al. (Genome Res. 2009 19: 1884-95), which areincorporated by reference for disclosure of those methods.

In some embodiments, the method may be used to identify positions thatdiffer in the first and second repeats. In these embodiments, a positionin the sequence read that is uncalled or associated with a low-qualityscore indicates that first and second repeat sequences differ at anucleotide that corresponds to that position. By way of example, if thesequence of the first repeat is GATCGGATCGA (SEQ ID NO: 1) and thesequence of the second repeat is GATCGTATCGA (SEQ ID NO: 2), then thesequence read may contain only one copy of the sequence GATCGG[G/T]ATCGA(SEQ ID NO: 3), where “G/T” is a base that has a mixed signal and istherefore associated with a poor quality base call. In this example, thequality of the base calls for the non-G/T bases will be high and thequality of the base call for the G/T base will be poor because some ofthe signal for that position, as analyzed by the base celling algorithm,will be generated by extension of the first primer and some of thesignal will be generated by the extension of the second primer, in thesame reaction.

After a position that has a low-quality base call has been identified(or, in some cases a position that is uncalled), the method may furthercomprise analyzing the underlying signals for that position to determinethe identities of the nucleotides at that position in the first andsecond repeats. For example, in the example described in the priorparagraph, the underlying signals (i.e., prior to base calling andreferred to as primary sequence data) could be analyzed to determinethat the position contains a mixture of G and T, thereby indicating thatthe first repeat contains a G or T at that position, and the secondrepeat contains the other nucleotide. As such, in any embodiment, themethod may comprise reading a combination of signals obtained bysimultaneous extension of the first and second primers to produceprimary sequencing data, processing the primary sequencing data using abase-calling algorithm to produce a sequence read composed of a sequenceof base calls, each base call associated with a quality score indicatingthe reliability of the base call; and outputting the sequence read basedon the quality scores. The quality scores allow differences between thefirst and second repeats to be identified.

In some embodiments, the first and second sites in the template (i.e.,the sequences to which the first and second primers bind) are the samesequence. In these embodiments, a single primer may be used in themethod, where the primer binds to two sites in the template. Inalternative embodiments, the first and second sites in the template(i.e., the sequences to which the first and second primers bind) may bedifferent sequences. In these embodiments, two or more primers may beused in the method, where the primer binds different sequences in thetemplate, one upstream of the first repeat and the other upstream fromthe second repeat.

In some embodiments, the method may involve determining how many strandsof the first repeat are sequenced relative to the number of strands ofthe second repeat, or if a sufficient number of molecules have beensequenced. These embodiments may be implemented by adding a calibrationsequence to the template, as shown in FIG. 3 . In these embodiments, thetemplate may comprise: a first calibrator sequence that is presentbetween the first site and the first repeat; and a second calibratorsequence that is present between the second site and the second repeat,wherein the first and second calibrator sequences are the same length(e.g., may be two, three or four bases in length or the same number offlows in length, depending on the sequencing method used) and have adifferent sequence; and the sequence read of step (b) includes positionsthat correspond to the first and second calibrator sequences. In theseembodiments, the underlying signals corresponding to the first andsecond calibrator sequences (prior to base calling) can be examined todetermine how many strands of the first and second repeats are sequencedin the reaction. Likewise, the underlying signals corresponding to thefirst and second calibrator sequences (prior to base calling) can beexamined to determine if a sufficient number of molecules have beensequenced.

In many sequencing-by-synthesis methods, template molecules are clonallyamplified, and the amplification products are sequenced in a highlyparallel fashion. Such methods are reviewed in, e.g., Metzker et al.(Genome Res. 2005 15:1767-1776) and Bentley (Curr. Opin. Genet. Dev.2006 16: 545-55). In Illumina sequencing the templates are spread in aflow cell and immobilized on a support (typically glass; see Fedurco etal., Nucleic Acids Res. 2006 34:e22), where they are amplified in placeby bridge PCR, which generates clusters of identical templates (or“colonies”) on the support. As such, the present method may beimplemented by amplifying the template on a substrate by bridge PCR toproduce a colony that comprises copies of the template, hybridizing oneor more primers to the colony, wherein a primer hybridizes to a firstsite that is upstream of the first repeat sequence and a primerhybridizes to a second site that is upstream of the second repeatsequence, wherein the first and second sites are: upstream of the firstand second repeat sequences, respectively, equidistant from the firstand second repeat sequences; and obtaining the sequence of the templateby a sequencing-by-synthesis sequencing reaction to produce a sequenceread that comprises a combination of the first and second repeatsequences. In some embodiments (and as illustrate in FIG. 3 ) the topand bottom strands of the bridge PCR amplification products may besequenced by Illumina’s sequencing method (which is referred to as“paired end” sequencing). As such, in some embodiments, the sequence ofa top strand of a bridge PCR product can be compared to the sequence ofa bottom strand of a bridge PCR product. Positions that are associatedwith a low-quality base call as a result of a difference in sequencebetween the first and second repeats should have a low-quality base callin both strands. In some embodiments, after sequencing both strands ofthe product by paired end sequencing one can produce a consensussequence for the top strand of the initial double-stranded fragment anda consensus sequence for the bottom strand of the initialdouble-stranded fragment. Low quality bases can be masked or integratedinto a model in which the quality scores are taken into account.Sequences that are not present in both the top and bottom strands of theinitial double-stranded fragment can thereby be eliminated from futureanalysis.

FIG. 3 illustrates an example of the method. In this example, thetemplate is a double stranded molecule and one or both strands need tobe sequenced (sequencing of the bottom strand is shown). In thisexample, the direct repeat template has flow cell sequences (e.g.,Illumina’s P5 and P7 sequences) at the ends and a primer binding sitebetween the first and second repeats. As shown, this molecule isamplified from a double-stranded fragment, where the first and secondrepeat sequences (W* and W or C* and C) are amplified from oppositestrands of a double-stranded fragment of DNA and are identical exceptfor positions that correspond to damaged nucleotides in thedouble-stranded fragment of DNA or errors that occur duringamplification. As shown, the method may involve hybridizing two primers(designated P₁ and P₂, which can be the same or different) to thetemplate (after it has been amplified). In this embodiment, the repeatseach have a calibrator sequence (referred to as “key 1” and “key 2” thatcan be used to determine the relative number of copies of the first andsecond repeats that are sequenced in a reaction. As shown, the part ofthe sequence read obtained from primer P₁ should contain key 1 (TT) andthe part of the sequence read obtained from primer P₂ should contain key1 (AA). In this example, there is a difference in sequence in the firstand second repeats, which can be identified as a base call with a lowquality (as a result of the template have a mixed nucleotide at thatposition).

In embodiments in which there is non-informational sequence immediatelydownstream of a primer binding site, the primers may be extended but notread for the first few cycles, thereby allowing one to obtain thesequence of the keys and/or repeats faster.

In some embodiments, the direct repeat template may have different,non-complementary sequences (Sequences 1 and 2 in FIG. 3 ) in at least10 nucleotides (e.g., at least 10, 12 or 14 nucleotides in length) thatallow the fragments to be amplified by a single pair of primers: a firstprimer that hybridizes to one sequence and another that hybridizes tothe complement of the other sequence. These sequences may be compatiblewith the sequencing platform being used. These sequences do not need tobe at the very end of a molecule although, in many embodiments, thesequences are within 50 nt, e.g., within 30 nt of the end of molecule.As would be apparent, the template molecule should have a junctionsequence between the first and second repeats. The junction sequenceshould be of 10 nucleotides (e.g., 10 to 100 nt). The template maycontain a molecular barcode (e.g., a sample identifier or moleculeidentifier) at any position (outside of the repeats).

The method described above can be employed to analyze genomic DNA fromvirtually any organism, including, but not limited to, plants, animals(e.g., reptiles, mammals, insects, worms, fish, etc.), tissue samples,bacteria, fungi (e.g., yeast), phage, viruses, cadaveric tissue,archaeological/ancient samples, etc. In certain embodiments, the genomicDNA used in the method may be derived from a mammal, wherein certainembodiments the mammal is a human. In exemplary embodiments, the samplemay contain genomic DNA from a mammalian cell, such as, a human, mouse,rat, or monkey cell. The sample may be made from cultured cells or cellsof a clinical sample, e.g., a tissue biopsy, scrape or lavage or cellsof a forensic sample (i.e., cells of a sample collected at a crimescene). In particular embodiments, the nucleic acid sample may beobtained from a biological sample such as cells, tissues, bodily fluids,and stool. Bodily fluids of interest include but are not limited to,blood, serum, plasma, saliva, mucous, phlegm, cerebral spinal fluid,pleural fluid, tears, lactal duct fluid, lymph, sputum, synovial fluid,urine, amniotic fluid, and semen. In particular embodiments, a samplemay be obtained from a subject, e.g., a human. In some embodiments, thesample comprises fragments of human genomic DNA. In some embodiments,the sample may be obtained from a cancer patient. In some embodiments,the sample may be made by extracting fragmented DNA from a patientsample, e.g., a formalin-fixed paraffin embedded tissue sample. In someembodiments, the patient sample may be a sample of cell-free“circulating” DNA from a bodily fluid, e.g., peripheral blood, e.g.,from the blood of a patient or of a pregnant female. The DNA fragmentsused in the initial step of the method should be non-amplified DNA thathas not been denatured beforehand.

The DNA in the initial sample may be made by extracting genomic DNA froma biological sample, and then fragmenting it. In some embodiments, thefragmenting may be done mechanically (e.g., by sonication, nebulization,or shearing, etc.) or using a double stranded DNA “dsDNA” fragmentaseenzyme (New England Biolabs, Ipswich MA). In some of these methods(e.g., the mechanical and fragmentase methods), after the DNA isfragmented, the ends may be polished and A-tailed prior to ligation toone or more adaptors. Alternatively, the ends may be polished andligated to adaptors in a blunt-end ligation reaction. In otherembodiments, the DNA in the initial sample may already be fragmented(e.g., as is the case for FFPE (formalin-fixed paraffin embedded)samples and circulating cell-free DNA (cfDNA), e.g., ctDNA). Thefragments in the initial sample may have a median size that is below 1kb (e.g., in the range of 50 bp to 500 bp, or 80 bp to 400 bp), althoughfragments having a median size outside of this range may be used.

In some embodiments, the amount of DNA in a sample may be limiting. Forexample, the initial sample of fragmented DNA may contain less than 200ng of fragmented human DNA, e.g., 1 pg to 20 pg, 10 pg to 200 ng, 100 pgto 200 ng, 1 ng to 200 ng or 5 ng to 50 ng, or less than 10,000 (e.g.,less than 5,000, less than 1,000, less than 500, less than 100, lessthan 10 or less than 1) haploid genome equivalents, depending on thegenome.

In some embodiments, sample identifiers (i.e., a sequence thatidentifies the sample to which the sequence is added, which can identifythe patient, or a tissue, etc.) can be added to the polynucleotidesprior to sequencing, so that multiple (e.g., at least 2, at least 4, atleast 8, at least 16, at least 48, at least 96 or more) samples can bemultiplexed. In these embodiments, the sample identifier may be ligatedto the initial polynucleotides as part of the asymmetric adaptor, or thesample identifier may be ligated to the polynucleotides in thesub-samples, before or after amplification of those polynucleotides.Alternatively, the tag may be added by primer extension, i.e., using aprimer that has a 3′ end that hybridizes to an adaptor sequence, and a5′ tail that contains the sample identifier.

The population of direct repeat molecules may be made in a variety ofdifferent ways. These methods rely on creating circular molecules,retaining physical proximity between the two strands of onedouble-stranded DNA molecule, or physically isolating two strands of onedouble-stranded molecule, during manipulation steps. The methods alsodivide into strategies requiring one, or more, adaptor types. Thesemethods can be done by fragmenting, polishing and then tailing the endsof the fragments before adaptor ligation. Alternatively, transposasescan be used to add adaptor sequences. In some embodiments, standardtransposons can be used but then modified to create a Y-shaped adaptorusing oligonucleotide replacement (Grunenwald H, Baas B, Goryshin I,Zhang B, Adey A, Hu S, Shendure J, Caruccio N, Maffitt M 2011. NexteraPCR-free DNA library preparation for next-generation sequencing. [Posterpresentation, AGBT 2011]; Gertz J, Varley KE, Davis NS, Baas BJ,Goryshin IY, Vaidyanathan R, Kuersten S, Myers RM 2012. Transposasemediated construction of RNA-seq libraries. Genome Res 22: 134-141).

In some embodiments, the direct repeat template may be made by (a)ligating adaptor sequences onto both ends of top and bottom strands of apopulation of fragments of double-stranded genomic DNA to producedouble-stranded molecules comprising (i) a top strand comprising a 5′sequence (e.g., X) at the 5′ end and a junction sequence (e.g., J) atthe 3′ end; and (ii) a bottom strand comprising a 5′ sequence (e.g., Y′)at the 5′ end, and the complement of the junction sequence (J′) at the3′ end; and (b) extending the 3′ end of the top strands (i.e., thestrand that contains sequence X) using the bottom strand as a template,thereby copying the complement of the bottom strand, as well assequences J and Y, into the same molecule as the top strand to produce adirect repeat molecule of formula: X-TOP-J-BOT′-Y, wherein: (i) withineach repeat molecule TOP and BOT′ are amplified from opposite strands ofa fragment of the double-stranded of genomic DNA and identical exceptfor positions that correspond to damaged nucleotides in thedouble-stranded fragment of genomic DNA or amplification errors. Inthese embodiments, TOP and BOT′ vary in the population and have a medianlength of at least 50 nucleotides and X and Y are different,non-complementary sequences of at least 10 nucleotides in length that donot vary in the population; and J is a junction sequence. Examples ofthis method are shown in the figures and described in greater detailbelow.

In some embodiments and as shown in FIGS. 4 and 5 , a direct repeatmolecule may be made by ligating a single adaptor onto both ends of topand bottom strands of a population of fragments of double-strandedgenomic DNA, such that, the individual molecules are in a covalentlyopen circle and, in in the individual molecules in the population,sequence X is added onto the 5′ end of the top strands of the fragmentand sequence Y′ is ligated onto the 5′ of bottom strands of thefragments. This method involves extending the 3′ end of the top strands(i.e., the strand that contains sequence X) using the bottom strand as atemplate, thereby copying the complement of the bottom strand, as wellas sequence Y, into the same molecule as the top strand. Such a moleculecan be amplified using primers that have a 3′ end that is the same as orthat hybridize to sequence X and Y. An example of such a method isillustrated in FIGS. 4 and 5 , where the top strand of the fragments ofgenomic DNA are indicated as “forward” and “reverse” respectively andsequences X and Y′ are indicated as sequences R1 and R2.

In some embodiments, the direct repeat molecules may be made by ligatinga single adaptor onto both ends of top and bottom strands of apopulation of fragments of double-stranded genomic DNA, such that, theindividual molecules are in a covalently closed circle and, in theindividual molecules in the population, sequence X is added onto the 5′end of the top strands of the fragment and sequence Y′ is ligated ontothe 5′ end of the bottom strands of the fragments. This method involvescreating one or more nicks by reacting, e.g., an adaptor containing dUTPand a mixture of UDG/endonuclease IV, extending the 3′ end of the topstrands (i.e., the strand that contains sequence X) using the bottomstrand as a template, thereby copying the complement of the bottomstrand, as well as sequence Y, into the same molecule as the top strand.Such a molecule can be amplified using primers that have a 3′ end thatis the same as or that hybridizes to sequence X and Y.

A similar product may be made by emulsion PCR, using an immobilizationapproach, or rolling circle amplification, single adapter methods andgreater than 1 adapter methods, as described in WO2018229547, which isincorporated by reference in its entirety. In some embodiments, thedirect repeat template may be of the formula X-TOP-J-BOT′-Y, wherein (i)within each repeat molecule TOP and BOT′ are amplified from oppositestrands of a double-stranded fragment of genomic DNA and are identicalexcept for positions that correspond to damaged nucleotides in thedouble-stranded fragment of genomic DNA or errors that occur duringamplification; (ii) TOP and BOT′ have a median length of at least 50nucleotides; (iii) X and Y are different, non-complementary sequences ofat least 10 nucleotides; and (iv) J is a junction sequence of, e.g., atleast 10 nucleotides in length. In some embodiments, the direct repeattemplate may have a strand of the formula X-(T)TOP(A)-J-(T)BOT′(A)-Y,wherein (T) and (A) are thymine and adenine nucleotides that areimmediately adjacent to TOP and BOT′. Such molecules may be made by, forexample (a) ligating adaptor sequences onto both ends of top and bottomstrands of a population of fragments of double-stranded genomic DNA toproduce double-stranded molecules comprising: (i) a top strandcomprising sequence X at the 5′ end and sequence J at the 3′ end; and(ii) a bottom strand comprising sequence Y′ at the 5′ end, and sequenceJ′ at the 3′ end; and (b) extending the 3′ end of the top strands usingthe bottom strands as a template, thereby adding the complement of thebottom strands and sequence Y onto the end 3′ end of the top strands.This method is illustrated in FIGS. 4 and 5 .

Kits

Also provided by this disclosure is a kit for practicing the subjectmethod, as described above. The various components of the kit may bepresent in separate containers or certain compatible components may bepre-combined into a single container, as desired.

In addition to above-mentioned components, the subject kits may furtherinclude instructions for using the components of the kit to practice thesubject methods, i.e., to provide instructions for sample analysis. Theinstructions for practicing the subject methods are generally recordedon a suitable recording medium. For example, the instructions may beprinted on a substrate, such as paper or plastic, etc. As such, theinstructions may be present in the kits as a package insert, in thelabeling of the container of the kit or components thereof (i.e.,associated with the packaging or subpackaging) etc. In otherembodiments, the instructions are present as an electronic storage datafile present on a suitable computer readable storage medium, e.g.,CD-ROM, diskette, etc. In yet other embodiments, the actual instructionsare not present in the kit, but means for obtaining the instructionsfrom a remote source, e.g., via the internet, are provided. An exampleof this embodiment is a kit that includes a web address where theinstructions can be viewed and/or from which the instructions can bedownloaded. As with the instructions, this means for obtaining theinstructions is recorded on a suitable substrate.

Utility

As would be readily apparent, the method described above may be employedto analyze any type of sample, including, but not limited to samplesthat contain heritable mutations, samples that contain somaticmutations, samples from mosaic individuals, pregnant females (in whichsome of the sample contains DNA from a developing fetus), and samplesthat contain a mixture of DNA from different sources. In certainembodiments, the method may be used identify a minority variant that, insome cases, may be due to a somatic mutation in a person.

In some embodiments, the method may be employed to detect an oncogenicmutation (which may be a somatic mutation) in, e.g., PIK3CA, NRAS, KRAS,JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, PGDFRA, KIT or ERBB2,which may be associated with breast cancer, melanoma, renal cancer,endometrial cancer, ovarian cancer, pancreatic cancer, leukemia,colorectal cancer, prostate cancer, mesothelioma, glioma,medulloblastoma, polycythemia, lymphoma, sarcoma or multiple myeloma(see, e.g., Chial 2008 Proto-oncogenes to oncogenes to cancer. NatureEducation 1:1). Other oncogenic mutations (which may be somaticmutations) of interest include mutations in, e.g., APC, AXIN2, CDH1,GPC3, CYLD, EXT1, EXT2, PTCH, SUFU, FH, SDHB, SDHC, SDHD, VHL, TP53,WT1, STK11/LKB1, PTEN, TSC1, TSC2, CDKN2A, CDK4, RB1, NF1, BMPR1A, MEN1,SMAD4, BHD, HRPT2, NF2, MUTYH, ATM, BLM, BRCA1, BRCA2, FANCA, FANCC,FANCD2, FANCE, FANCF, FANCG, NBS1, RECQL4, WRN, MSH2, MLH1, MSH6, PMS2,XPA, XPC, ERCC2-5, DDB2 or MET, which may be associated with colon,thyroid, parathyroid, pituitary, islet cell, stomach, intestinal,embryonal, bone, renal, breast, brain, ovarian, pancreatic, uterine,eye, hair follicle, blood or uterus cancers, pilotrichomas,medulloblastomas, leiomyomas, paragangliomas, pheochromocytomas,hamartomas, gliomas, fibromas, neuromas, lymphomas or melanomas. In someembodiments, the method may be employed to detect a somatic mutation ingenes that are implicated in cancer, e.g., CTNNB1, BCL2, TNFRSF6/FAS,BAX, FBXW7/CDC4, GLI, HPVE6, MDM2, NOTCH1, AKT2, FOXO1A, FOXO3A, CCND1,HPVE7, TAL1, TFE3, ABL1, ALK, EPHB2, FES, FGFR2, FLT3, FLT4, KRAS2,NTRK1, NTRK3, PDGFB, PDGFRB, EWSR1, RUNX1, SMAD2, TGFBR1, TGFBR2, BCL6,EVI1, HMGA2, HOXA9, HOXA11, HOXA13, HOXC13, HOXD11, HOXD13, HOX11,HOX11L2, MAP2K4, MLL, MYC, MYCN, MYCL1, PTNP1, PTNP11, RARA, SS18 (see,e.g., Vogelstein and Kinzler 2004 Cancer genes and the pathways theycontrol. Nature Medicine 10:789-799). The method of embodiment may beemployed to detect any somatic mutation that is implicated in cancerwhich is catalogued by COSMIC (Catalogue of Somatic Mutations inCancer), data of which can be accessed on the internet.

Other mutations of interest include mutations in, e.g., ARID1A, ARID1B,SMARCA4, SMARCB1, SMARCE1, AKT1, ACTB/ACTG1, CHD7, ANKRD11, SETBP1,MLL2, ASXL1, which may be at least associated with rare syndromes suchas Coffin-Siris syndrome, Proteus syndrome, Baraitser-Winter syndrome,CHARGE syndrome, KBG syndrome, Schinzel-Giedion syndrome, Kabukisyndrome or Bohring-Opitz syndrome (see, e.g., Veltman and Brunner 2012De novo mutations in human genetic disease. Nature Reviews Genetics13:565-575). Hence, the method may be employed to detect a mutation inthose genes.

In other embodiments, the method may be employed to detect a mutation ingenes that are implicated in a variety of neurodevelopmental disorders,e.g., KAT6B, THRA, EZH2, SRCAP, CSF1R, TRPV3, DNMT1, EFTUD2, SMAD4,LIS1, DCX, which may be associated with Ohdo syndrome, hypothyroidism,Genitopatellar syndrome, Weaver syndrome, Floating-Harbor syndrome,hereditary diffuse leukoencephalopathy with spheroids, Olmsted syndrome,ADCA-DN (autosomal-dominant cerebellar ataxia, deafness and narcolepsy),mandibulofacial dysostosis with microcephaly or Myhre syndrome (see,e.g., Ku et al. (2012) A new paradigm emerges from study of de novomutations in the context of neurodevelopmental disease. MolecularPsychiatry 18:141-153). The method may also be employed to detect asomatic mutation in genes that are implicated in a variety ofneurological and neurodegenerative disorders, e.g., SCN1A, MECP2,IKBKG/NEMO or PRNP (see, e.g., Poduri et al. (2014) Somatic mutation,genetic variation, and neurological disease. Science 341(6141):1237758).

In some embodiments, a sample may be collected from a patient at a firstlocation, e.g., in a clinical setting such as in a hospital or at adoctor’s office, and the sample may be forwarded to a second location,e.g., a laboratory where it is processed, and the above-described methodis performed to generate a report. A “report” as described herein, is anelectronic or tangible document which includes report elements thatprovide test results that may indicate the presence and/or quantity ofminority variant(s) in the sample. Once generated, the report may beforwarded to another location (which may be the same location as thefirst location), where it may be interpreted by a health professional(e.g., a clinician, a laboratory technician, or a physician such as anoncologist, surgeon, pathologist or virologist), as part of a clinicaldecision.

The method may be used to analyze diseases that are associated withmutations, transplant rejection and has applications in non-invasiveprenatal testing.

Accordingly, the preceding merely illustrates the principles of theinvention. It will be appreciated that those skilled in the art will beable to devise various arrangements which, although not explicitlydescribed or shown herein, embody the principles of the invention andare included within its spirit and scope. Furthermore, all examples andconditional language recited herein are principally intended to aid thereader in understanding the principles of the invention and the conceptscontributed by the inventors to furthering the art and are to beconstrued as being without limitation to such specifically recitedexamples and conditions. Moreover, all statements herein recitingprinciples, aspects, and embodiments of the invention as well asspecific examples thereof, are intended to encompass both structural andfunctional equivalents thereof. Additionally, it is intended that suchequivalents include both currently known equivalents and equivalentsdeveloped in the future, i.e., any elements developed that perform thesame function, regardless of structure. The scope of the presentinvention, therefore, is not intended to be limited to the exemplaryembodiments shown and described herein. Rather, the scope and spirit ofthe present invention is embodied by the appended claims.

1. A method of sequencing a template that comprises a first repeatsequence and a second repeat sequence, wherein the first and secondrepeat sequences are in a direct repeat and either identical or nearlyidentical, comprising: (a) in the same reaction, hybridizing a primer toa first site that is upstream of the first repeat sequence andhybridizing a primer to a second site that is upstream of the secondrepeat sequence, wherein the first and second sites are: (i) upstream ofthe first and second repeat sequences, respectively, and (ii)equidistant from the first and second repeat sequences; and (b)subjecting the hybridization product of (a) to a sequencing-by-synthesissequencing reaction to produce a sequence read that comprises acombination of the first and second repeat sequences.
 2. The method ofclaim 1, wherein within each template the first repeat sequence and thesecond repeat sequence are amplified from opposite strands of adouble-stranded fragment of DNA and are identical except for positionsthat correspond to damaged nucleotides in the double-stranded fragmentof DNA or errors that occur during amplification.
 3. The method of claim2, wherein the double-stranded fragment of DNA is genomic DNA.
 4. Themethod of claim 3, wherein the genomic DNA is eukaryotic genomic DNA. 5.The method of claim 3, wherein the genomic DNA is isolated from a tissuebiopsy.
 6. The method of claim 3, wherein the genomic DNA is cell-freeDNA (cfDNA).
 7. The method of claim 3, wherein the genomic DNA ismicrobial genomic DNA.
 8. The method of claim 3, wherein the genomic DNAis viral genomic DNA.
 9. The method of claim 1, wherein the sequenceread of (b) comprises, for each position of the sequence read, a qualityscore indicating the reliability of the base(s) called at that position.10. The method of claim 9, wherein a position in the sequence read thatis uncalled or associated with a low-quality score indicates that firstand second repeat sequences differ at a nucleotide that corresponds tothat position.
 11. The method of claim 10, further comprising analyzingprimary sequencing data for a position that has a low-quality score todetermine the identities of the nucleotides at that position in thefirst and second repeats.
 12. The method of claim 1, wherein step (b)comprises: (i) reading a combination of signals obtained by simultaneousextension of the first and second primers to produce primary sequencingdata; (ii) processing the primary sequencing data using a base-callingalgorithm to produce a sequence read composed of a sequence of basecalls, each base call associated with a quality score indicating thereliability of the base call; and (iii) outputting the sequence readbased on (ii).
 13. The method of claim 1, wherein thesequencing-by-synthesis of step (b) comprises simultaneously extendingthe first and second primers in the presence of reversible chainterminators.
 14. The method of claim 1, wherein the first and secondsites in the template are the same sequence.
 15. The method of claim 1,wherein the first and second sites in the template are differentsequences.
 16. The method of claim 1, wherein the template comprises:(i) a first calibrator sequence that is present between the first siteand the first repeat; and (ii) a second calibrator sequence that ispresent between the second site and the second repeat, wherein the firstand second calibrator sequences are the same length and have a differentsequence; and the sequence read of step (b) includes positions thatcorrespond to the first and second calibrator sequences.
 17. The methodof claim 16, further comprising analyzing the signals corresponding tothe first and second calibrator sequences to determine how many strandsof the first and second repeats are sequenced in the reaction.
 18. Themethod of claim 17, further comprising analyzing the signalscorresponding to the first and second calibrator sequences to determineif a sufficient number of molecules have been sequenced.
 19. The methodof claim 1, wherein first and second repeats are less than 2,000nucleotides in length.
 20. The method of claim 1, wherein the method isdone by: amplifying the template on a substrate by bridge PCR to producea colony that comprises copies of the template; hybridizing one or moreprimers to the colony, wherein a primer hybridizes to a first site thatis upstream of the first repeat sequence and a primer hybridizes to asecond site that is upstream of the second repeat sequence, wherein thefirst and second sites are: upstream of the first and second repeatsequences, respectively, and equidistant from the first and secondrepeat sequences; and obtaining the sequence of the template by asequencing-by-synthesis sequencing reaction to produce a sequence readthat comprises a combination of the first and second repeat sequences.